Search CORE

6 research outputs found

SMAuC -- The Scientific Multi-Authorship Corpus

Author: Bevendorff Janek
Gienapp Lukas
Kircheis Wolfgang
Körner Erik
Potthast Martin
Sauer Philipp
Stein Benno
Publication venue
Publication date: 04/11/2022
Field of study

With an ever-growing number of new publications each day, scientific writing poses an interesting domain for authorship analysis of both single-author and multi-author documents. Unfortunately, most existing corpora lack either material from the science domain or the required metadata. Hence, we present SMAuC, a new metadata-rich corpus designed specifically for authorship analysis in scientific writing. With more than three million publications from various scientific disciplines, SMAuC is the largest openly available corpus for authorship analysis to date. It combines a wide and diverse range of scientific texts from the humanities and natural sciences with rich and curated metadata, including unique and carefully disambiguated author IDs. We hope SMAuC will contribute significantly to advancing the field of authorship analysis in the science domain

arXiv.org e-Print Archive

The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Author: Fröbe Maik
Gienapp Lukas
Hagen Matthias
Potthast Martin
Reimer Jan Heinrich
Scells Harrisen
Schmidt Sebastian
Stein Benno
Publication venue
Publication date: 31/07/2023
Field of study

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

arXiv.org e-Print Archive

Evaluating Generative Ad Hoc Information Retrieval

Author: Bevendorff Janek
Deckers Niklas
Fröbe Maik
Gienapp Lukas
Hagen Matthias
Kiesel Johannes
Potthast Martin
Scells Harrisen
Stein Benno
Syed Shahbaz
Wang Shuai
Zuccon Guido
Publication venue
Publication date: 08/11/2023
Field of study

Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.Comment: 14 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

STEREO: Scientific Text Reuse in Open Access Publications

Author: Gienapp Lukas
Kircheis Wolfgang
Potthast Martin
Sievers Bjarne
Stein Benno
Publication venue
Publication date: 06/09/2022
Field of study

We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. Webis-STEREO-21 allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications.Comment: 14 pages, 3 figures, 4 table

arXiv.org e-Print Archive

Shared Tasks as Tutorials: A Methodical Approach

Author: Ajjour Yamen
Akiki Christopher
Bondarenko Alexander
Elstner Theresa
Ferro Nicola
Frochte Jörg
Fröbe Maik
Gienapp Lukas
Hagen Matthias
Hofmann Sven
Kolyada Nikolay
Loebe Frank
Mohr Janis
Potthast Martin
Sandfuchs Stephan
Stein Benno
Wiegmann Matti
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 06/09/2023
Field of study

In this paper, we discuss the benefits and challenges of shared tasks as a teaching method. A shared task is a scientific event and a friendly competition to solve a research problem, the task. In terms of linking research and teaching, shared-task-based tutorials fulfill several faculty desires: they leverage students' interdisciplinary and heterogeneous skills, foster teamwork, and engage them in creative work that has the potential to produce original research contributions. Based on ten information retrieval (IR) courses at two universities since 2019 with shared tasks as tutorials, we derive a domain-neutral process model to capture the respective tutorial structure. Meanwhile, our teaching method has been adopted by other universities in IR courses, but also in other areas of AI such as natural language processing and robotics

Association for the Advancement of Artificial Intelligence: AAAI Publications

Animal models with group-specific additive genetic variances: extending genetic group models

Author: A Charmantier
A Charmantier
A Husby
A Ko
A Legarra
AA Hoffmann
AJ Wilson
AJ Wilson
Alina K. Niskanen
AM Holand
AM Holand
AR Wood
B Calsbeek
BW Kennedy
CNS Silva
CR Henderson
CR Henderson
CR Henderson
D Gulisija
D Houle
D Simpson
D Speed
Dilan Saatoglu
DJ Spiegelhalter
DS Falconer
EC Anderson
F Eroukhmanoff
F Guillaume
H Holand
H Jensen
H Jensen
H Jensen
H Pärn
H Rue
Henrik Jensen
HT Baalsrud
I Steinsland
J Buskirk Van
J Huisman
J Yang
J Yang
JD Hadfield
JM Cano
JM Reid
JY Willi
KV Stopher
KY Chen
L Alfonso
LA García-Cortés
LEB Kruuk
LL Lo
Lukas F. Keller
M Lynch
M Plummer
M Slatkin
MA Elzo
MC Dong
ME Goddard
ME Wolak
ME Wolak
ME Wolak
MG Bulmer
MR Robinson
NH Barton
P Gienapp
P Villemereuil de
PC Lambert
R Baumung
R Lande
R Lande
R Lande
RA Fisher
RA Mrode
RC Lacy
RJC Cantet
RL Quaas
RL Quaas
RL Quaas
S Wright
S Wright
SL Lundregan
Stefanie Muff
T Bonnet
TFC Mackay
Y He
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref